Description of the Problem

We were given 30 State of the Nation (SONA) speeches from 1994 to 2018 to analyse. The specific objectives are to:

  1. Infer sentiment and changes in sentiment over time
  2. Describe the topics that emerge
  3. Predict the President from a given sentence of text
  4. Evaluate an out of sample performance of the predictions

Approach

We collaborated using the following GitHub location: https://github.com/samperumal/dsi-assign2.

We initially split the work as follows and each of us created a folder with our names to push our work to for others to view:

We presented our work to each other and made suggestions for improvement. Before diving into any prediction, we felt it was important to do an Exploratory Data Analysis (EDA) to get a sense of the high level overview of the dataset. This was done by Audrey.

The initial results from the Neural Net gave a 63% accuracy on the validation set. In order to achieve a higher accuracy, we attempted to feed the results of the Topic Modelling and Sentiment Analysis into the Neural Net. Thus, we needed to understand from each other what the output of these 2 methods were and the input required by the neural net to get the data into a useable format which took some discussion and a few iterations.

Given the low accuracy of the neural net (NN), we tried a Convolutional Neural Net (CNN). Sam got the initial model working. Merve made improvements. Vanessa tuned the hyperparamters. The result is discussed in this document and is a collaborative effort.

The CNN did not provide an improvement over the initial NN, and as a result we tried a Recurrent Neural Net (RNN). This takes ina sequence of data and gives importance to the order of the words in order to make a prediction.

Data Preparation

Initially we each performed our own import of the data, splitting out the year and president and tokenisation but we realised there was duplication of effort here and different naming conventions which made it difficult to collaborate and use each other’s output. In addition, Sam noticed that some of the data was not loaded due to special characters. Sentences were not being tokenised correctly for various reasons and he became responsible for performing the data clean up (preprocessing) and outputting a .RData file that everyone could use to run our work.

The data as provided consisted of 30 text files, with filenames encoding the president’s name, the year of the speech, and whether it was pre/post an election (which is absent in non-election years). In working through the files, we discovered that two files were identical which was corrected in the data source with replacement. Additionally, in reading the files, we also identified 3 files that had one or more bytes which caused issues with the standard R file IO routines. Specifically 1 file had a leading Byte-Order-Mark (BOM) which is unique to windows operating system files, and 2 other files had invalid unicode characters, which suggests a text-to-speech processing application was used and experienced either transmissionor storage errors. In all the cases the offending characters were simply removed from the input files.

Having fixed basic read issues, we then examined the content of each file and the simplistic tokenisation achieved by applying unnest_tokens to the raw lines read in from the files. Several issues were uncovered, and in each case a regular expression was created to correct the issue in the raw read lines:

Having fixed the text to allow correct sentence tokenisation, and applied the unnest_tokens function, we then determined a unique ID for each sentence by applying a hash digest function to the sentence text. This unique ID allowed everyone to work on the same data with confidence, and enabled us to detect 72 sentences that appeared identically in at least 2 speeches. As these duplicates would potentially bias the analysis and training, all instances of duplicates were removed from the dataset.

One final note is that each speech starts with a very similar boiler plate referencing various attendees to the SONA in a single, run-on sentence. We believe this header does not add significantly to the content of the speech, and so we excluded all instances across all speeches.

The figure above shows the change in number of sentences per president after filtering. On the whole there are more sentences per president, with only a single reduction. Additionally, the highest increases are associated with the files where read-errors prevented us from previously reading the entire file. This change is equally evident in the boxplots below, which show the change in distribution per president of words and characters per sentence.

Overall there is a much tighter grouping of sentences, with less variation and more conistent lengths, which is useful for techniques which depend on equal length inputs, such as some of the Neural Networks. The final histogram below shows the histogram of number of sentences per year/president after filtering, which still bears the same basic shape as before filtering, but with a better profile.

Data Split and Sampling

For all group work, we separated our full dataset into a random sampling of 80% training and 20% validation data, which was saved into a common .RData file. This ensured that there would be consistency across the data we were working on so that we could use each others work and compare results consistently.

The graphs above make it clear that our data is also very unbalanced. In an attempt to correct for this, we applied supersampling with replacement to the training dataset to ensure an equal number of sentences per president. Training was attempted using both balanced and unbalanced training data, but it did not appear to make much difference. Balancing was conducted on the training dataset only to ensure there are no duplicates in the validation set that might skew validation error.

Overview of the dataset

Each president has made a certain number of SONA speeches, depending on their term in office and whether there was 1 speech that year or 2 in the year of an election (pre and post election). Since the data is dependent on their term in the office it is unbalanced. Sentence counts per president after cleaning the data is :

## [1] "President sentence counts:"
## 
##   deKlerk   Mandela     Mbeki Motlanthe Ramaphosa      Zuma 
##       103      1879      2803       346       240      2697
## [1] "Baseline_accuracies"
## 
##   deKlerk   Mandela     Mbeki Motlanthe Ramaphosa      Zuma 
##  1.276648 23.289539 34.742191  4.288547  2.974715 33.428359

Let us understand the number of words used by each President and how this varies across each SONA speech.

Average number of words used per President

We need to create a metric called “avg_words” which is simply the total number of words across all SONA speeches made by a particular president, divided by the total number of SONA speeches that president made.

Average number of words used per President
president num_words num_speeches avg_words
Mbeki 29952 9 3328
Motlanthe 3206 1 3206
Mandela 16801 6 2800
Zuma 23554 9 2617
Ramaphosa 2258 1 2258
deKlerk 783 1 783

On average, Mbeki used the most words in his SONA speeches, followed by Motlanthe and de Klerk used the least. Mandela and Zuma are ranked in the middle of their peers. The current president (Ramaphosa) used fewer words than all of his post 1994 peers.

Number of words used per SONA

Of the 3 presidents that have made more than 1 SONA speech, Mbeki used more words on average than both Mandela and Zuma and the variance in the number of words used per SONA speech is also higher for Mbeki. In 2004, which was an election year, the average number of words Mbeki used was lower in both his pre- and post-election speeches. Towards the end of his term, his average number of words also dropped off. The data suggests that perhaps Mbeki’s average number of words is correlated with his confidence in being re-elected President.

Common words used across all SONA speeches

Figure: Common bigram wordcloud

Figure: Common bigram wordcloud

Common bigrams used across all SONA speeches

Figure: Common bigram wordcloud

Figure: Common bigram wordcloud

Lexical Diversity per President

Lexical diversity refers to the number of unique words used in each SONA.

The number of unique words per SONA ranges from about 700 for de Klerk in 1994 to over 2500 with Mandela in his post election speech of 1999. Mbeki’s post election speech of 2004 and Zuma’s post election speech of 2014 reached close to the 2500 mark.

It’s interesting that whilst the trend in the number of unique words used increased for Mandela, Mbeki and Zuma both show a upward trend in the lead up to the election year, followed by a downward trend after elections, despite nearing the 2500 unique words mark in their post election speeches.

If we exclude the post election speeches, the number of unique words used by Mbeki during his term from 2000 to 2008 averages just under 2000 whereas the number of unique words used by Zuma during his term from 2009 to 2017 averages just over 1500.

Lexical Density per President

Lexical density refers to the number of unique words used in each SONA divided by the total number of words and a high value is an indicator of word repitition.

De Klerk repeated over 30% of his words in his 1994 pre election SONA speech. On average, Mandela repeated about 25% of words in each of his SONA speeches and this reduced to about 20% in the post election speech of 1999. Mbeki’s repitition rate was about 23% and this reduced to 20% in the post election speech of 2004. Zuma’s repitition rate is over 30% with the exception of his post election speech of 2014 at about 23%.

Sentiment Analysis

Bing Lexicon Results

The “bing” lexicon encodes words as either “positive” or “negative”. However, not all words used in the SONA speeches are in the lexicon so we need to adjust for this.

Sentiment per President

Let’s understand how many “positive” and “negative” words are used by each president across all their SONA speeches and create a metric called “sentiment” which is simply the total number of positive words minus the total number of negative words. We then adjust for the total number of words used from the lexicon in the “sentiment_score” metric.

Sentiment Score per President
president negative positive sentiment sentiment_score
Zuma 788 1466 678 30.08
Ramaphosa 102 181 79 27.92
Mbeki 1314 2287 973 27.02
Motlanthe 180 263 83 18.74
Mandela 1036 1434 398 16.11
deKlerk 64 58 -6 -4.92

Of the 3 presidents that have made more than 1 SONA speech, Zuma has the highest sentiment score, followed by Mbeki and then Mandela. Zuma’s sentiment score is nearly double Mandela’s. It’s interesting that the current President, Ramaphosa, has the second highest sentiment score, not far behind Zuma and only slightly ahead of Mbeki.

What are the 10 positive words most frequently used by each president?

De Klerk’s most used words were “freedom”, “peaceful” and “support” and at least 2 of these 3 come up in all the president’s most used words. Mandela’s most used words include “progress”, “improve”, “reconciliation” and “commitment” which are all words indicating repair and a move towards something better. Mbeki uses many of the same words but also introduces “empowerment” which is a word carried through by Zuma and Ramaphosa, and “success” which is carried through by Zuma. This is likely due to the fact that Black Economic Empowerment (BEE) was introduced under Mbeki and was a policy carried through by Zuma and Ramaphosa. In addition, these words suggest progress in the move towards repair or something better, first spoken about by Mandela. Ramaphosa also introduces the words “confidence”, “effectively”, “enhance” and “efficient”, which are words commonly seen in a business context and have not shown up in any other SA president’s top 10 most frequently used words in a SONA since 1994.

Which of the positive words most frequently used are common across presidents?

Common positive words across post 1994 presidents include: “freedom”, “regard”, “support”, “improve” and “progress”. Words introduced by Mandela and unique to his speeches are: “restructuring”, “reconciliation”, “committment”, “contribution” and “succeed”. Mbeki introduces the words “empowerment”, “comprehensive”, “integrated” and “improving” into the top words used and this is unique to his speeches. Zuma uses the words “success”, “reform” and “pleased” frequently and other presidents do not. Ramaphosa introduces the words “significant”, “productive”, “confidence” and “effectively” which have not yet been seen in the any other SA president’s top 10 most frequently used words in a SONA since 1994.

What are the 10 negative words most used by each president?

## Joining, by = "word"

Common negative words pre 1994 include: “concerns”/“concern”/“concerned”, “unconstitutional”, “illusion”, “hopeless”, “disagree”, “deprive”, “conflict”, and “boycott”.

Common negative words post 1994 include: “corruption”, “crime”/“criminal”, “poverty”/“poor”, “inequality”, “issue”/“issues” and “crisis”.

A negative word introduced by and unique to Mandela’s top 10 is “struggle”. Mbeki is the only president with the word “racism” in his top 10 negative words. Motlanthe has “conflict” in his top 10 which no other president does. Zuma has “rail” which likely refers to the railway system and does have negative connotations for South Africa. Both Zuma and Ramaphosa use the word “difficult” a lot. Ramaphosa introduces the word “expropriation” into the top 10 for the first time amongst his peers.

How many of the negative words most used were used by each president?

The interpretation is much the same as before. Note the clear separation between the top 10 negative words used pre and post 1994 elections, indicative of the pre and post apartheid regimes.

What proportion of words used are positive vs negative?

The 2 vertical black lines are drawn at 60% and 70% positivity rates. In the majority of years, SONA speeches fall within this range of positivity however a there are a few more negative speeches in earlier years and a few more positive speeches in later years.

Change in Sentiment over time

The trend appears to be more positive and less negative over time but how can we be sure?

We will test whether negative sentiment is increasing or decreasing, then we will test whether positive sentment is increasing or decreasing . We will use a Binomial model because the frequencies are between 0 and 1. Finally, we will test whether average sentiment is increasing or decreasing using a linear model.

Is negative sentiment increasing over time?

## 
## Call:
## glm(formula = freq ~ as.numeric(year), family = "binomial", data = subset(sentiments_relative, 
##     sentiment == "negative"))
## 
## Deviance Residuals: 
##      Min        1Q    Median        3Q       Max  
## -0.21779  -0.05759   0.01224   0.07113   0.19692  
## 
## Coefficients:
##                   Estimate Std. Error z value Pr(>|z|)
## (Intercept)       25.88105  115.18655   0.225    0.822
## as.numeric(year)  -0.01316    0.05743  -0.229    0.819
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 0.39087  on 24  degrees of freedom
## Residual deviance: 0.33826  on 23  degrees of freedom
## AIC: 27.484
## 
## Number of Fisher Scoring iterations: 3

The slope is negative but the beta of the year variable is not significant so we cannot conclude that negative sentiment is decreasing over time.

Is postive sentiment increasing over time?

## 
## Call:
## glm(formula = freq ~ as.numeric(year), family = "binomial", data = subset(sentiments_relative, 
##     sentiment == "positive"))
## 
## Deviance Residuals: 
##      Min        1Q    Median        3Q       Max  
## -0.19692  -0.07113  -0.01224   0.05759   0.21779  
## 
## Coefficients:
##                   Estimate Std. Error z value Pr(>|z|)
## (Intercept)      -25.88105  115.18655  -0.225    0.822
## as.numeric(year)   0.01316    0.05743   0.229    0.819
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 0.39087  on 24  degrees of freedom
## Residual deviance: 0.33826  on 23  degrees of freedom
## AIC: 27.484
## 
## Number of Fisher Scoring iterations: 3

The slope is positive but the beta of the year variable is not significant so we cannot conclude that positive sentiment is increasing over time.

Is average sentiment increasing over time?

## 
## Call:
## glm(formula = avg_sentiment ~ as.numeric(year), family = "gaussian", 
##     data = sentiments_per_year)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -21.252   -7.877   -1.282    8.584   21.444  
## 
## Coefficients:
##                    Estimate Std. Error t value Pr(>|t|)  
## (Intercept)      -1369.9476   652.4104  -2.100   0.0460 *
## as.numeric(year)     0.6952     0.3253   2.137   0.0425 *
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for gaussian family taken to be 153.4216)
## 
##     Null deviance: 4536.4  on 26  degrees of freedom
## Residual deviance: 3835.5  on 25  degrees of freedom
## AIC: 216.44
## 
## Number of Fisher Scoring iterations: 2

The slope is positive and the beta of the year variable is significant at 1% so we can conclude that average sentiment is increasing over time.

But we need to be cautious with this interpretation because what could actually be going on here is that the “bing”" lexicon has more than double the number of negative words than positive words so this could be influencing the results and SONA speeches may in fact be more positive than they appear to be.

## 
## negative positive 
##     4782     2006

Distribution of “bing” Sentiment per President

Apart from the last 2 presidents, Ramaphosa and Zuma, the presidents are in time order. We can see that other than Motlanthe, the trend is an increasing average sentiment over time but at a decreasing rate. The interquartile range of Mbeki is smaller than Zuma’s which is smaller than Mandela’s.

Change in “bing” Sentiment over time

Average sentiment is the proportion of positive words out of all the words in the “bing” lexicon. Mandela shows a very erratic average sentiment, ranging from 0 to over 25. Mbeki and Zuma’s average sentiment mostly ranges between 25 and 50, with the exception of a few such as 2000, 2008, 2012, 2017.

Sentiment Analysis using “afinn” lexicon

Afinn lexicon measures positivity on a scale from -5 negative to +5 positive.

afinn Sentiment
score n weighted_score
-5 2 -10
-4 27 -108
-3 689 -2067
-2 1059 -2118
-1 867 -867
1 2422 2422
2 3525 7050
3 379 1137
4 42 168
5 32 160

The most number of words are scored positive 2, followed by positive 1. This becomes even more pronounced when scores are multiplied by counts to get weighted scores. The distribution of all “afinn” words is as follows:

## 
##  -5  -4  -3  -2  -1   0   1   2   3   4   5 
##  16  43 264 965 309   1 208 448 172  45   5

Words with a score of -2 dominate the lexicon, followed by words with a score of 2. We found a relatively high number of words with a score of 2 in this analysis but it is unlikely to only be a result of its prevalence in the lexicon and we can conclude that it is probably an accurate assessment of the sentiment that prevails in the text.

Distribution of “afinn” Sentiment per President

The interpretation is much the same as with the “bing” lexicon in that the trend is an increasing average sentiment over time however Zuma’s median sentiment is lower than the general trend.

Change in “afinn” Sentiment over time

Mandela and Zuma show a wave-like pattern of sentiment. Mbeki shows an increasing and then decreasing pattern.

Sentiment Analysis using “nrc” lexicon

The nrc lexicon infers emotion with certain words.

The distribution of all “nrc” words is given by:

## 
##        anger anticipation      disgust         fear          joy 
##         1247          839         1058         1476          689 
##     negative     positive      sadness     surprise        trust 
##         3324         2312         1191          534         1231

Words can be assigned more than 1 sentiment but we do not expect as many words to come up under “anticipation”, “joy” or “surprise” given the relatively lower counts in the lexicon. So “anticipation” has a surprisingly high relative count across all presidents.

Given that “positive” sentiment is the most frequent classification in the “nrc” lexicon, it is not surprising that it comes out as the most frequently assigned classification across all presidents. The distributions across the various sentiments are very similar for all presidents so this lexicon does not provide any insights about specific presidents.

The negative most used words which are also associated with the “anger”, “disgust”, fear" and “sadness” emotions are: “violence”, “struggle” and “poverty”.

The positive most used words which are also associated with the “anticipation”, “joy” and surprise" emotions are: “youth”, “public” and “progress”.

The most usedwords that evoke the “trust” emotion are: “system”, “president”, “parliament” and “nation”.

Analysis of Sentiment using Bigrams

Checking for Negation in Bigrams

It just so happens that the 4 negation words are also stop words so have already been removed from the Bigrams and we need to add them back. This can be shown as follows:

##  [1] "not"     "not"     "not"     "no"      "no"      "no"      "never"  
##  [8] "never"   "without" "without"

Let’s redo the bigrams without removing stop words and see how many bigrams contain 1 of the negation words:

## # A tibble: 1 x 1
##       n
##   <int>
## 1   118

There are only 118 bigrams that contain negation words. Let’s look at a few examples:

## Selecting by id
bing Sentiment of Bigrams
year word1 word2 sentiment1 sentiment2 president
1994 no illusions neutral negative deKlerk
1994 no doubts neutral negative deKlerk
1994 no right neutral positive deKlerk
1994 no illusion neutral negative deKlerk
1994 no illusions neutral negative deKlerk
1995 not succeed neutral positive Mandela
1997 not falter neutral negative Mandela
1997 not shirk neutral negative Mandela
1998 no magic neutral positive Mandela
1999 without regard neutral positive Mandela
2001 no benefit neutral positive Mbeki
2004 without undue neutral negative Mbeki
2006 not wrong neutral negative Mbeki
2006 not dead neutral negative Mbeki
2008 not disappoint neutral negative Mbeki
2009 not lose neutral negative Motlanthe
2009 not detract neutral negative Motlanthe
2009 without undue neutral negative Motlanthe
2009 not suffer neutral negative Motlanthe
2009 not underestimate neutral negative Motlanthe
2018 no liberation neutral positive Ramaphosa
2018 not displace neutral negative Ramaphosa
2009 not falter neutral negative Zuma
2009 not backward neutral negative Zuma
2014 not sufficiently neutral positive Zuma
2014 not well neutral positive Zuma
2017 not worked neutral positive Zuma

Let’s see how many there are per president:

Percentage of Bigrams with Negation Words
president n total perc
deKlerk 5 237 2.11
Mandela 39 4700 0.83
Mbeki 34 9613 0.35
Motlanthe 7 1067 0.66
Ramaphosa 2 742 0.27
Zuma 31 9296 0.33

Given that there is such a low percentage of bigrams with negation words, it is not expected to significantly change the interpretation above and recoding the sentiments is not justified.

Topic Modelling

An effective topic model can summarise the ideas and concepts within a document - this can be used in various ways. A user can understand the main themes within the corpus of documents and draw conclusions from analysis of these topics or they can use the information as a type of dimensionality reduction and feed these topics into different supervised or unsupervised algorithms.

In this project, our group has used topic modelling to better understand the common topics that come up over the SONA speeches, how these are related to different presidents and speeches and how they change over time. In addition, the probability that a sentence belongs to a certain topic was used in an attempt to classify which sentence was said by which president (see section on Neural Networks).

Data

The data used in this section is the clean and processed data as described in the data preprocessing section of this document above. The resulting sentence data has been used and dissected further without consideration to train and validation unless otherwise stated.

Methodology

The following methodology was followed:

  1. Each sentence was tokenised into “bigrams”, stop words removed and a document-term matrix set up. Bigrams were chosen over individual words as they provided more context and meaning.
  2. An optimisation technique was used to help determine the number of topics that are covered in the corpus of documents and this optimisation was validated on a hold out sample.
  3. Latent Dirichlet allocation was used to determine the probability of bigrams belong to certain topics and the probability that sentences belonged to topics.
  4. Text mining methods were deployed to extract insight into the different topics
  5. The probability of sentences to each topics were then passed through to a neural network.

Step 1: Tokenisation, Remove Stop words and Document Term Matrix

Figure: Most popular terms

Figure: Most popular terms

After tokenisation and removal of stop words, the top 20 most used terms across all of the SONA speeches are displayed. Unsurprisingly, “South Africa” is the most used term followed closely by “South African” and “South Africans” and “Local Government”. These terms do not add to our understanding of the topics and tend to confuse the topic modelling going forward. The removal of the terms allows for a cleaner interpretation. "Public service is then the most used term.

Step 2: Optimisation of k - the number of topics.

A pre-requisite of topic modelling is knowing the number of topics that each corpus may contain (i.e. the latent factor k). In some cases, this may be a fair assumption but without reading though each speech, how could one know how many different topics have been articulated in the SONA’s? Luckily, Murzintcev Nikita has published an R- package (ldatuning) that helps to optimise the number of topics (k) over three different measures. The measures used to determine the number of topics, are discussed in an RPubs paper which can found here: link and the following optimisation largely follows the accompanying vignette: [link] (https://cran.r-project.org/web/packages/ldatuning/vignettes/topics.html)

The following extract from the RPub paper gives a brief explanation of the methods used to optimise for k:

Extract from RPubs

"Arun2010: The measure is computed in terms of symmetric KL-Divergence of salient distributions that are derived from these matrix factor and is observed that the divergence values are higher for non-optimal number of topics (maximize)

CaoJuan2009: method of adaptively selecting the best LDA model based on density.(minimize)

Griffths: To evaluate the consequences of changing the number of topics T, used the Gibbs sampling algorithm to obtain samples from the posterior distribution over z at several choices of T(minimize)"

In addition to this, Nikita considers how the number of k may change over a validation or hold out sample. His term for this is“perplexity” which he defines as “[it] measures the log-likelihood of a held-out test set; Perplexity is a measurement of how well a probability distribution or probability model predicts a sample”

Below is an attempt to optimise for k and to check that the choice of k holds over an unseen data set.

Figure: Optimisation Metrics From the above plot, the marginal benefit from adding another topic, stops at around 5-10 topics. In order to test this, the “perplexity” over a test sample for the document term matrix can be checked.

Figure: Perplexity Plot

Figure: Perplexity Plot

As more topics are used, the perplexity of the training sample does decrease but that of the test sample increases from around 11 topics. The perplexity of the test sample seems to be minimised at around 5 topics.

The evidence from these two plots suggest that the optimal number topics sits at 5 topics.

Step 3: Latent Dirichlet allocation

For this assignment, Latent Dirichlet allocation (LDA) was used for the topic modelling. Other methods, such as a Latent Semantic Analysis (LSA) or Probabilistic Latent Semantic Analysis (pLSA) could have been used but LDA is useful due to the fact that it allows:

  1. Each document within the corpus to be a mixture of topics
  2. Each topic to be a mixture of bigrams
  3. The topics are assumed to be drawn from Dirichlet distribution (i.e. not k different distributions as with pLSA) so there are fewer parameters to estimate and no need to estimate the probability that the corpus generates a specific document.

Step 4: Extracting insights

Understanding the topics via the bigrams

The beta matrix produced gives the probability of the topic producing that bigram (i.e. that the phrase is in reference to that topic.) From this measure, one can get a sense of what the character of the topic is. By using the most popular phrases in each topic, understanding of the flavour of each topic emerges. However, it must be kept in mind that terms can belong to more than one topic so applying some logic to get a theme or flavour should be done liberally.

Topic 1

From the display of popular terms, it can be determined that the topic one has a vague connection to “job creation” - this is the most common terms but is supported by other terms that have a high probability of being generated by this topic such as: + “world cup” + “national youth” + “infrastructure development”

These concepts all support the idea of the job creation as each of these will generate jobs for the country. But there is some noise in the topic for “address terms” i.e. honourable speaker or honourable chairperson. “Nelson Mandela” and “President Mandela” crop up too which suggests that alongside the job creation theme, there exists some of what can be termed “terms of endearment”

Figure: Popular terms in Topic 1 Figure: WordCloud for Topic1

Topic 2

As with the previous topic, there are some random “terms of endearment” in this topic (i.e. “madam speaker”) but it is not as evident as in the first topic. This is to be expected as bigrams can be generated by more than one topic as each topic is a mixture of bigrams! The next four terms sum up the main themes for this topic:

  • “Economic Empowerment”
  • “Black Economic”
  • “Justice System”
  • “Criminal Justice”

This topic can be named as “Economy/ Criminal and Justice System”.

Figure: Popular terms in Topic 2 Figure: WordCloud for Topic 2

Topic 3

Despite the most popular terms being “United Nation” and “private sector”, there is a theme that is “developing”. As in development plan, resource development, national development and development programme etc. and thus, the topic is named “Development”.

Figure: Popular terms in Topic 3 Figure: WordCloud for Topic 3

Topic 4

Once again, there is a “term of endearment” in the popular terms (“fellow south” which is assumed short for “fellow South African’s” which is one of former President Zuma’s favourite phrases). With all the other terms combined, a theme of “Social Reform/ Regional and Municipal Government” takes shape.

Given that there is a possible trigram evident here, it may be worth exploring in future work.

Figure: Popular terms in Topic 4 Figure: WordCloud for Topic 4

Topic 5

“Public sector” and “private sector” are popular terms in topic 5. After consideration of the various other terms, of which some have cross over with other topics and discussion, the eventual name for this topic became “Public Sector Entities”

Figure: Popular terms in Topic 5 Figure: WordCloud for Topic 5

A different way of looking at this topic could be to investigate the biggest differential in terms between topics. For instance, using the log(base 2) ratio between topic 1 and topic 5, shows the terms that have the widest margin between the two topics (i.e. they are far more likely to be in topic 5 versus topic 1).

Figure: Comparison of biggest differential between Topic 1 and Topic 5 Bigram

Figure: Comparison of biggest differential between Topic 1 and Topic 5 Bigram

For instance, “social programmes”, “human fulfilment”, “rights commission” are all generated in significantly larger proportions by Topic 5 compared to Topic 1 while “national social”, “training colleagues” and “sector unions” all exist with in Topic 1.

Given the naming of Topic 5 as “Public Sector Entities” and Topic 1 as “Job Creation/Terms of Endearment” these terms do seem to be grouped in line with expectation.

Understanding the mixture of topics within the sentence

The LDA model allows each of the sentence to be represented as a mixture of topics. The gamma matrix shows the document-topic probability for each sentence. i.e. the probability that each sentence is drawn from that topic. For instance, the following sentence sampled from random shows that it has a 0.905% probability of being drawn from topic 4 based on the use of the bigrams within it. The sentence appears to be talking about the water and the infrastructure around it. The label for topic of was “Social Reform/ Regional and Municipal Government” and this statement seems to be somewhat relevant to it.

Sample sentence showing topic probabilities
president year sentence X1 X2 X3 X4 X5
Zuma 2010 yet, we still lose a lot of water through leaking pipes and inadequate infrastructure. 0.023575 0.023575 0.023575 0.9057001 0.023575

Using this method, the sentences can be roughly classified to a topic based on the probabilities (i.e. classify the sentence by the topic with the highest probability) and further analysis can be conducted.

(Note: the which.is.max breaks ties at random so where a sentence has equal probabilities, is will decide at random to which topic it gets assigned)

Figure: Probability of belonging to each Topic, by President

Figure: Probability of belonging to each Topic, by President

Consider the mixture of topics that each individual president covers during the SONA address. Despite the imbalance in the number of sentences said by each president, there seems to be a fairly standard shape to the topics discussed. The two exceptions to this are de Klerk and Zuma. All other presidents tend to send around 10-15% on topic 1 (“Job Creation/Terms of Endearment”) and the 15-20% on Topic 2 (“Economy/Criminal and Justice System”), Topic 3 (“Development”) and Topic 4 (“Social Reform/Regional and Municipal Government”) and the around another 10% on Topic 5 (“Public Sector Entities”). This trend means that it may be difficult for a supervised model to pick up difference in presidents based on the topic covered.

As stated, the only two presidents where this trend differs are President de Klerk and President Zuma. President de Klerk spend the majority of his time on Topic 1 (“Job Creation/Terms of Endearment”) followed by Topic 2 (“Economy/Criminal and Justice System”). Given the context around the time period, it may be unsurprising that “terms of endearment” and “criminal and Justice systems” come up since his speech would be littered with names of people and political parties as well as talking about past injustices.

President Zuma spends the majority of his speeches on Topic 4 (“Social Reform/Regional and Municipal Government”). Once again, given context that his terms as President was marked with service delivery strikes, two major droughts over a few different regions and discussions around and reform this may be unsurprisingly. And in fact, when the most popular terms from topic 4 is recalled (“fellow south”) is may even be predictable that this would be the most “talked about” topic for President Zuma. What is interesting that given the attention to the issues of State Capture that characterised Zuma’s presidency, his coverage of Topic 5 (“Public Sector Enterprises”) is much smaller than that of his peers.

A similar analysis can be taken over time.

Figure: Change in topic composition over time, per president

Figure: Change in topic composition over time, per president

The graph shows that over time, topics 1 and 5 are the least discussed topics while topics 2,3 and 4 all get much the same airtime. There are a number of notable spikes/valleys:

  • In 1996, Topic 2 (“Economy/Crime and Justice System”) spikes
    • The 1996 SONA was a few months ahead of the introduction of the new constitution as well as at the time of the start of the Truth and Reconciliation Commission. It could be suggested that these two ideas would drive up the topic in the SONA speech.
  • In 2005, topic 1 (“Job Creations/Terms of Endearment”) dives while topic 4 (“Social Reform/Region and Municipal Government”) and Topic 2 (“Economy/Criminal and Justice System”) spike considerably.
  • Mbeki’s term in presidency (1998 - 2008) was characterised by a rise in crime specifically in farm attacks as well as the HIV/AID epidemic and the start of the Black Economic Empowerment in 2005 which could attribute the spikes and drops of topics in 2005.

  • In 2012, Topic 2 dives considerably (“Economy/Criminal and Justice System”).
  • From various media reports, Zuma’s 2012 SONA speech largely covered the success of the government while skipping over future plans. It may be a reason while Topic 4 (“Social Reform/Regional and Municipal Government”) rises sharply.

Step 5: Using the topic to predict the president

One of the aims behind topic modelling is to reduce the dimensions of the data to allow for other techniques to be applied. In this instance, the aim was to reduce the SONA speeches to a collection of topics that would help predict which president was responsible for a sentence in the SONA speech. The assumption was that each president might have a unique set of topics or mixture of topics that could characterise their particular speech. However, there does not seem to be evidence of this. The matrix with the probability of each sentence belonging to a topic is used in the Neural Nets with Topic Modelling section and the results are discussed.

Neural Nets

Neural Net with Bag of Words Data

The input to our Neural Networks is the frequency of each word within the sentence. To accomplish this, we unnest the sentence data, count each word in each sentence and spread the sentence word counts so that we have sentence id’s in each row and each word is a column. This simple neural network model was our first attempt.

Figure: Bag of words model - neural net training performance

Figure: Bag of words model - neural net training performance

This model has L2 regularization to avoid overfitting but even so it didn’t help very much. The accuracy is 0.55886. The optimizer_rmsprop has learning rate 0.003. This was choosen after trying lr=c(0.001, 0.002, 0.003) To make readability easier the model with best learning rate is used.

As we can see from the plot, the model overfits after the second iteration, since the loss function starts increasing in value. To avoid that we attempted a smaller model with fewer neurons and added a dropout.

Figure: Bag of words smaller model - neural net training performance

Figure: Bag of words smaller model - neural net training performance

1 2 3 4 5 6
180 33 56 7 4 0
50 181 26 4 9 0
57 16 111 3 1 0
14 8 10 3 0 0
7 12 4 0 1 0
1 2 3 3 0 1

Accuracy rate is: 0.5911.

Cohen’s Kappa

The Kappa value tells you how much better your classifier is performing than simply guessing at random according to the frequency of each class.

“Cohen’s kappa is always less than or equal to 1. Values of 0 or less, indicate that the classifier is useless. There is no standardized way to interpret its values. Landis and Koch (1977) provide a way to characterize values. According to their scheme a value < 0 is indicating no agreement, 0-0.20 as slight, 0.21-0.40 as fair, 0.41-0.60 as moderate, 0.61-0.80 as substantial, and 0.81-1 as almost perfect agreement.” [Reference: Landis, J.R.; Koch, G.G. (1977). “The measurement of observer agreement for categorical data”. Biometrics 33 (1): 159-174]

The Kappa value is, 0.416 and this means we are doing better than random.

The accuracy is slightly better than bigger model with no-dropouts (0.581%) Just for the word-count model seems good enough. But this model does not consider how important each word is to its corpus. So we should consider a better model to try.

After the fourth iteration validation loss starts increasing which is a sign of overfitting.

Neural Net with tf-idf Data

TF-IDF is a statisic that shows how important a word is in the context of its corpus. Feeding a NN with TF-IDF we expect the results to be slightly better than the word-count NN model.

1 2 3 4 6
183 46 51 0 0
48 203 18 1 0
52 17 118 1 0
16 8 11 0 0
5 16 3 0 0
0 3 4 2 1
Figure: TF-IDF model - neural net training performance

Figure: TF-IDF model - neural net training performance

Accuracy rate is: 0.6245.

Cohen’s Kappa

The accuracy is 0.6158 and the model starts overfitting after fourth epoch. So this is slightly better than the bag of words word-count model as we expected.

Neural Net with Sentiment Analysis Data

1 2 3 4 5 6
143 134 3 0 0 0
70 199 1 0 0 0
91 95 2 0 0 0
15 20 0 0 0 0
8 15 1 0 0 0
3 7 0 0 0 0
Figure: Sentiment Analysis - neural net training performance

Figure: Sentiment Analysis - neural net training performance

Accuracy of sentiment analysis is: 0.426 [0.4263].

Cohen’s Kappa: 0.1323447

Sentiment analysis also reaches it’s smallest validation loss value on the fifth iteration. But the train accuracy and test accuracy is changing very slightly at each iteration. This model does not seem to be doing well on either the training set or the test set. The test accuracy is 0.4361834 and the training accuracy is 0.4353395. If we look at the NRC sentiment lexicon it is visible that all presidents share same sentiment distribution pattern, so this is why it is not overfitting. Because our set aside test set is actually no different than the training set.

Neural Nets with Topic Modelling Data (Gamma Values)

Topic modelling only predicted for president 1(Mbeki) and president 2(Zuma).

Figure: Topic Modelling - neural net training performance

Figure: Topic Modelling - neural net training performance

The train and the test set are not very distinct from each other just like sentiment analysis. If we look at the mixture of the topics by each president in topic modelling chunk we can see that all the topics for each president are kind of uniformed and hard to seperate each president’s topic from one another.

Cohen’s Kappa: 0

Sequential Neural Networks

The bag-of-words model as applied to Neural Networks treats each sentence as an unordered list of integer or one-hot-encoded elements. This captures whether a word occurs in a sentence, and the frequency of occurrence for tf-idf models. While this can be effective, it does ignore any signals in the data related to the ordering and relative positions of words. Sequential neural networks address this problem by treating the data as an ordered list of integers using a dictionary that provides a unique mapping between words and integers. The network then applies various layers to this input that attempt to extract the sequential information for use in later standard layers.

For all our sequential neural network attempts, we converted each sentence to a vector of integers using a word dictionary as our x-data, and one-hot-encoded the presidents as our y-data.

Embeddings

An embedding layer is a dimensionality reduction technique that attempts to encode the relationships between words in a sentence (input as a variable length integer array with padding) as a fixed-length floating point vector. An embedding has a tunable hyper-parameter for the number of latent factors to map every sentence on to, where each latent factor attempts to capture a semantic dimension of the sentence as a whole. Embeddings aim to capture the linear substructure of sentences through Euclidean distances between words in the n-dimensional unit hypercube, where n is the number of latent factors specified.

Embeddings can be trained on the corpus of sentences that comprise the dataset under investigation, however this can prove limiting if the there is a relatively small quantity of training data. An alternative approach is to re-use a previously trained embedding layer, such as the Glove embedding. This has the advantage of leveraging the results from a much larger, and theoretically more generic, dataset in an application of transfer learning. The SONA data has a large number of non-standard or foreign words included, however, which theoretically limits the applicability of pre-trained embeddings.

Convolutional Neural Network (CNN)

A convolutional layer applies a moving weighted-average filter (kernel) over the input data that attempts to extract simple patterns for use in later layers. They are an approach to reducing the dimensionality of input data by using a shared weighting across all input nodes, thereby addressing the exploding/vanishing gradient problem that would otherwise occur with a standard fully-connected layer. By way of example, a 100-node input layer followed by a 50-node fully connected layer would have 5000 weights to fit, whereas with an equivalent convolutional layer there would only be 50 weights.

The convolutional layer has a number of tunable hyper-parameters:

  • The number of filters/nodes to train - more filters can capture more granular patterns in the data, but with diminishing returns.
  • The size of the kernel to be applied to the input, which determines how many adjacent data points are input to the convolution operator.
  • The padding, if any to be used, which determines how to handle data at the boundaries of the input.
  • The activation function applied to the output of the convolution operator.
  • The stride of the kernel, which can be used to used to avoid overfitting at the expense of potentially missing important patterns.

Chosen structure

We experimented with a number of different network topologies, and finally settled on the following:

  • An embedding layer with 70 latent factors trained on the full word dictionary across all speeches.
  • A dropout layer set to randomly exclude 50% of inputs from the previous layer on each iteration, to prevent overfitting.
  • A convolutional layer with 50 filters, a kernel size of 3, and stride of 1
  • A global max pooling layer, which assists in dimensionality reduction
  • A fully connected dense layer with 128 nodes
  • A dropout layer set to randomly exclude 50% of inputs from the previous layer on each iteration, to prevent overfitting.
  • An activation layer using the “relu” function
  • A fully connected dense layer with 6 nodes, to map to our output encoding
  • An activation layer using the “softmax” function, to produce output consistent with our one-hot-encoding of presidents.

Deep CNN

We also experimented with Deep CNN architectures by adding on additional densely connected layers (with accompanying dropout and activation layers) below the Convolutional layer. Despite much experimentation with this, additional layers did not appear to have any noticeable effect on the accuracy of our results, which have been omitted from this document.

CNN with Transfer Learning (Pre-trained Embeddings)

We will be using GloVe embeddings. GloVe stands for “Global Vectors for Word Representation” and is an unsupervised learning algorithm for obtaining vector representations for words as it is stated in their website. Training is performed on aggregated global word-word co-occurrence statistics from a corpus, and the resulting representations showcase interesting linear substructures of the word vector space. [Reference:Jeffrey Pennington, Richard Socher, and Christopher D. Manning. 2014. GloVe: Global Vectors for Word Representation:https://nlp.stanford.edu/pubs/glove.pdf ] Specifically, we will use the 100-dimensional GloVe embeddings of 400k words computed on a 2014 dump of English Wikipedia.

This note is taken from the reference given above and as they state the accuracy they achieved on python is twice as good.

“IMPORTANT NOTE:This example does yet work correctly. The code executes fine and appears to mimic the Python code upon which it is based however it achieves only half the training accuracy that the Python code does so there is clearly a subtle difference. We need to investigate this further before formally adding to the list of examples”

[reference for implementation on Python: https://blog.keras.io/using-pre-trained-word-embeddings-in-a-keras-model.html] [reference for implementation on R: https://keras.rstudio.com/articles/examples/pretrained_word_embeddings.html] [reference for implementation on R: https://github.com/rstudio/keras/blob/master/vignettes/examples/pretrained_word_embeddings.R]

Also for the pre-trained embeddings to work well it needs to be trained with a similar type of data as you are trying to classify. The GloVe embeddings are trained on Wikipedia data, thus one can expect that it would not neccesarily help predict the presidents better for our sentences.

Figure: Glove transfer learning - neural net training performance

Figure: Glove transfer learning - neural net training performance

1 2 3 4 5 6
286 104 156 31 11 2
183 350 114 24 31 6
79 58 93 11 5 7
12 12 9 0 0 0
6 5 4 0 0 1
3 5 5 0 0 1
  • Majority of the sentences of Mbeki is predicted as Mandela(379/569).
  • Majority of the sentences of Zuma is predicted as Mandela (244/534) and the second majority is predicted as himself (195/534).
  • Majority of the sentences of Mandela is predicted as Mandela(263/381).
  • Majority of the sentences of Mothlane(38/66) is predicted as Mandela.
  • Majority of sentences of Ramaphosa is predicted as Mandela(19/47) or Zuma(15/47).
  • Majority of the sentences of deKlerk is predicted as Mandela(14/17).

Cohen’s Kappa

With a Kappa value of 0.2036538, we are doing moderately better than a random classifier.

Recurrent Neural Network (RNN)

A Recurrent Neural Network is an attempt to model the relationship between words in a sentence based on their relative positions. It involves repeatedly applying the same layer to each word of a sentence (rather than the sentence as a whole), in a manner that allows the layer to “remember” aspects of the words in the sentence that have already been seen. For our application we used a long-term-short-term (LSTM) memory layer that trains both the weights and the memory of the layer.

Chosen structure

We experimented with a number of different network topologies, and finally settled on the following:

  • An embedding layer with 70 latent factors trained on the full word dictionary across all speeches.
  • A dropout layer set to randomly exclude 50% of inputs from the previous layer on each iteration, to prevent overfitting.
  • A convolutional layer with 50 filters, a kernel size of 3, and stride of 1
  • A global max pooling layer, which assists in dimensionality reduction
  • A fully connected dense layer with 128 nodes
  • A dropout layer set to randomly exclude 50% of inputs from the previous layer on each iteration, to prevent overfitting.
  • An activation layer using the “relu” function
  • A fully connected dense layer with 6 nodes, to map to our output encoding
  • An activation layer using the “softmax” function, to produce output consistent with our one-hot-encoding of presidents.

Results

RNN’s have been shown to achieve very good performance when applied to Natural Language Processing (NLP) of text, due to the similarities with how humans process language. Unfortunately our applied RNN did not surpass the performance of the other networks we attempted, despite many attempts at tuning. We suspect the sparsity of the dataset played a large role in this result, as it was replicated on both the balanced and unbalanced training data.

Neural Network Results Overview

Model Train Accuracy Test Accuracy
Bigger word-count 0.99 0.6
Smaller word-count 0.95 0.59
TF-IDF model 0.87 0.63
Sentiment Analysis 0.43 0.43
Topic modelling 0.36 0.35
Transfer Learning 0.88 0.45
CNN 0.99 0.61
RNN 0.35 0.35

Conclusions

The sentiment and topic models give some insight into the SONA speeches with regard to the general feeling and themes discussed. Disappointingly, they do not give enough flavour of the man behind the presidency – perhaps they shared speech writers?

Predicting which president said a sentence has only reasonable success despite many alternative approaches. Our best validation accuracy cannot overcome the 63% mark. Considering the imbalanced nature of the data and the fact that Zuma dominates the number of sentences, it may be the best we can hope to predict. However, this does four times better than randomly picking one the six presidents which would only give an accuracy rate of 16.67%.

References

https://www.kaggle.com/rtatman/tutorial-sentiment-analysis-in-r

https://www.datacamp.com/community/tutorials/sentiment-analysis-R

https://nlp.stanford.edu/pubs/glove.pdf

https://blog.keras.io/using-pre-trained-word-embeddings-in-a-keras-model.html

https://keras.rstudio.com/articles/examples/pretrained_word_embeddings.html

https://github.com/rstudio/keras/blob/master/vignettes/examples/pretrained_word_embeddings.R

Landis, J.R.; Koch, G.G. (1977). “The measurement of observer agreement for categorical data”. Biometrics 33 (1): 159-174

Jeffrey Pennington and Richard Socher and Christopher D. Manning, “Empirical Methods in Natural Language Processing (EMNLP)”, http://www.aclweb.org/anthology/D14-1162